[proof of concept] Add support for balancing at the directory level #47
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This script works great for datasets containing mainly large files, so that the execution time is dominated by the time spent copying data between disks. However, when working with very large numbers of small files, effective throughput is very low because a lot of time is spent creating and waiting on extra processes (grep, cp, rm) and writing to stdout.
When trying to rebalance a large number of small files, I've found ~100x speedups by copying a directory with
cp -rax
rather than balancing each file individually.This PR is a proof-of-concept of adding support for this kind of rebalancing. If there's interest in adding this kind of functionality to this script, I can clean it up and get it to a state to be merged.
The goal here is to allow easy rebalancing in cases where most data is in large numbers of small files, especially where an entire dataset is too large to duplicate, but individual folders are known to be small enough to copy without filling the pool.
As an example, consider the following folder structure, where the pool has 1 TB of capacity:
As the pool is more than half full, we can't call
cp
on the whole dataset, or we'd run out of space. Calling this script on/pool/dataset
would be too slow, because tracking and copying the files indir_b
would have too much overhead. You could call the script onhuge_file
anddir_a
separately, but then you'd still have to find a way to rebalancedir_b
.On this branch, you can invoke the script with
--explicit-paths /pool/dataset/*
, which will make copies ofdir_a
,dir_b
, andhuge_file
one at a time, without running out of space.How it works
The new behavior can be used by passing
--explicit-paths
to the script, e.g.With the
--explicit-paths
flag set, it also supports passing multiple paths either explicitly or via globbing, as in the above example.Instead of using
find
to generate a list of files to rebalance, it will directly use the list of paths provided in the arguments. Therebalance
function has been updated to handle copying and removing both directories and files. Otherwise, the logic is largely unchanged, meaning the handling of multiple passes is unchanged.Limitations
There are a few limitations to this approach, both in general and in what has been done so far in this branch.
Limitations to the idea in general
Limitations in this branch (that should be fixed before merging)